IAGLR 2021 H2O Workshop

1 Agenda

  • Welcomes and introductions – Timothy Maguire (UMich CIGLR) and Jo-fai Chow (H2O.ai) (5 mins)
  • Great Lakes research questions driven by data – Timothy Maguire (15 mins)
  • Introduction to the H2O.ai software – Jo-fai Chow (5 mins)
  • Live-code example of data manipulation and machine learning analysis – Jo-Fai Chow (20 mins)
  • Results in context – Timothy Maguire (10 mins)
  • Q&A (5 mins)

2 Introduction to H2O

TBA

3 Software and Code

3.1 Code

3.2 R Packages

  • Check out setup.R
  • For this tutorial:
    • h2o for automatic and explainable machine learning
  • For RMarkdown
    • knitr for rendering this RMarkdown
    • rmdformats for readthedown RMarkdown template
    • DT for nice tables

4 H2O Basics

# Let's go
library(h2o) # for H2O Machine Learning

4.1 Start a local H2O Cluster (JVM)

h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 minutes 51 seconds 
    H2O cluster timezone:       Europe/London 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.32.1.2 
    H2O cluster version age:    14 days, 22 hours and 13 minutes  
    H2O cluster name:           H2O_started_from_R_joe_ljh865 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.33 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.5 (2021-03-31) 
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll()   # Optional: remove anything from previous session 
# Enter your lucky seed here ...
n_seed <- 12345

5 Data - Dom Lake Huron Abund

lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')
datatable(as.data.frame(head(lake_data)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)
h2o.describe(lake_data)
      Label Type Missing Zeros PosInf NegInf       Min        Max       Mean
1        C1  int       0     0      0      0    1.0000  885.00000  443.00000
2      Year  int       0     0      0      0 2006.0000 2012.00000 2009.00113
3    Region enum       0    96      0      0    0.0000    3.00000         NA
4   Station enum       0    24      0      0    0.0000  122.00000         NA
5 Replicate  int       0     1      0      0    0.0000    3.00000    1.99887
6       lat real       0     0      0      0   43.2695   46.23333   44.68487
        Sigma Cardinality
1 255.6217909          NA
2   2.3727809          NA
3          NA           4
4          NA         123
5   0.8190319          NA
6   0.8554837          NA
 [ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)

h2o.hist(lake_data$DBUG, breaks = 100)

5.1 Define Target and Features

target_DPOL <- "DPOL"
target_DBUG <- "DBUG"

# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))

print(features)
 [1] "Year"                  "Region"                "Station"              
 [4] "Replicate"             "lat"                   "lon"                  
 [7] "depth"                 "substrate"             "Season"               
[10] "Na2O"                  "Magnesium"             "MagnesiumOxide"       
[13] "AluminumOxide"         "SiliconDioxide"        "Quartz"               
[16] "PhosphorusPentoxide"   "Sulphur"               "DDT_TOTAL"            
[19] "P.P.TDE"               "P.P.DDE"               "P.P.DDE.1"            
[22] "HeptachlorEpoxide"     "Potassium"             "PotassiumOxide"       
[25] "Calcium"               "CalciumOxide"          "TitaniumDioxide"      
[28] "Chromium"              "Manganese"             "ManganeseOxide"       
[31] "XTR.Iron"              "Iron"                  "IronOxide"            
[34] "Cobalt"                "Nickel"                "Copper"               
[37] "Zinc"                  "Selenium"              "Strontium"            
[40] "Beryllium"             "Silver"                "Cadmium"              
[43] "Tin"                   "TotalCarbon"           "OrganicCarbon"        
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen"         "Mercury"              
[49] "Lead"                  "Uranium"               "Vanadium"             
[52] "Arsenic"               "Chloride"              "Fluoride"             
[55] "Sand"                  "Silt"                  "Clay"                 
[58] "Mean_grainsize"        "simpsonD"              "shannon"              
[61] "Chironomidae"          "Oligochaeta"           "Sphaeriidae"          
[64] "Diporeia"             
ml_overview

5.2 Split Data into Train/Test

h_split <- h2o.splitFrame(lake_data, ratios = 0.75, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation
dim(h_train)
[1] 656  68
dim(h_test)
[1] 229  68

5.3 Cross-Validation

CV

6 Worked Example - Target “DPOL”

6.1 Baseline Models

  • h2o.glm(): H2O Generalized Linear Model
  • h2o.randomForest(): H2O Random Forest Model
  • h2o.gbm(): H2O Gradient Boosting Model
  • h2o.deeplearning(): H2O Deep Neural Network Model
  • h2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC

Let’s start with GBM

model_gbm_DPOL <- h2o.gbm(x = features,               # All 13 features
                              y = target_DPOL,            # medv (median value of owner-occupied homes in $1000s)
                              training_frame = h_train,   # H2O dataframe with training data
                              nfolds = 5,                 # Using 5-fold CV
                              seed = n_seed)              # Your lucky seed
# Cross-Validation
model_gbm_DPOL@model$cross_validation_metrics
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  4620.394
RMSE:  67.97348
MAE:  16.30104
RMSLE:  NaN
Mean Residual Deviance :  4620.394
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)
H2ORegressionMetrics: gbm

MSE:  2613.399
RMSE:  51.12142
MAE:  12.70523
RMSLE:  NaN
Mean Residual Deviance :  2613.399

Let’s use RMSE

RMSE

Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!

# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)

6.2 Manual Tuning

6.2.1 Check out the hyper-parameters for each algo

?h2o.glm 
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost

6.2.2 Train a xgboost model with manual settings

model_gbm_DPOL_m <- h2o.gbm(x = features, 
                                y = target_DPOL, 
                                training_frame = h_train,
                                nfolds = 5,
                                seed = n_seed,
                                # Manual Settings based on experience
                                learn_rate = 0.1,       # use a lower rate (more conservative)
                                ntrees = 120,           # use more trees (due to lower learn_rate)
                                sample_rate = 0.7,      # use random n% of samples for each tree  
                                col_sample_rate = 0.7)  # use random n% of features for each tree

6.2.3 Comparison (RMSE: Lower = Better)

# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
                               "GBM: Gradient Boosting Model (Manual Settings)"),
                     stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA

# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))

# Show Comparison (RMSE: Lower = Better)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.3 H2O AutoML

# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
                         y = target_DPOL,
                         training_frame = h_train,
                         nfolds = 5,                        # 5-fold Cross-Validation
                         max_models = 30,                   # Max number of models
                         stopping_metric = "RMSE",          # Metric to optimize
                         exclude_algos = c("deeplearning", "xgboost"), # exclude some alogs for a quick demo
                         seed = n_seed)

6.3.1 Leaderboard

datatable(as.data.frame(automl_DPOL@leaderboard), 
          rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.3.2 Best Model (Leader)

automl_DPOL@leader
Model Details:
==============

H2ORegressionModel: gbm
Model ID:  GBM_grid__1_AutoML_20210514_152911_model_7 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              43                       43               10605         4
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         4    4.00000          9         16    13.79070


H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  742.9795
RMSE:  27.25765
MAE:  7.378285
RMSLE:  NaN
Mean Residual Deviance :  742.9795



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  4022.796
RMSE:  63.42551
MAE:  15.04154
RMSLE:  NaN
Mean Residual Deviance :  4022.796


Cross-Validation Metrics Summary: 
                            mean         sd cv_1_valid cv_2_valid cv_3_valid
mae                    15.045994  3.9332395   12.12232   11.87795  17.959179
mean_residual_deviance 4025.1875  2845.8782  2456.1912  2354.3364   8599.125
mse                    4025.1875  2845.8782  2456.1912  2354.3364   8599.125
r2                     0.5540781 0.14167649  0.5431397  0.5116641 0.44749096
residual_deviance      4025.1875  2845.8782  2456.1912  2354.3364   8599.125
rmse                     60.5994  21.002983  49.559975  48.521503   92.73147
rmsle                        NaN        0.0        NaN        NaN        NaN
                       cv_4_valid cv_5_valid
mae                     20.492651  12.777869
mean_residual_deviance   4990.901  1725.3843
mse                      4990.901  1725.3843
r2                     0.46939796  0.7986977
residual_deviance        4990.901  1725.3843
rmse                     70.64631  41.537746
rmsle                         NaN        NaN

6.3.3 Comparison (RMSE: Lower = Better)

d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
                         RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
                         RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)

datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.4 Make Predictions

yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)
head(yhat_test, 5)
     predict
1  35.388011
2   2.586987
3 577.323965
4   3.505565
5   4.594125

6.5 Explainable AI - Target “DPOL”

6.5.1 Global Explanations

# Explain Best Model from AutoML
h2o.explain(automl_DPOL@leader, newdata = h_test)


Residual Analysis
=================

> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.



Variable Importance
===================

> The variable importance plot shows the relative importance of the most important variables in the model.



SHAP Summary
============

> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.



Partial Dependence Plots
========================

> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.



Individual Conditional Expectations
===================================

> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.

6.5.2 Local Explanations

h2o.explain_row(automl_DPOL@leader, newdata = h_test, row_index = 1)


SHAP explanation
================

> SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.



Individual Conditional Expectations
===================================

> Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response for a given row. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance.

6.5.3 Local Contributions (for Tree-based Models)

predictions <- h2o.predict(automl_DPOL@leader, newdata = h_test)
as.data.frame(head(predictions, 5))
     predict
1  35.388011
2   2.586987
3 577.323965
4   3.505565
5   4.594125
contributions <- h2o.predict_contributions(automl_DPOL@leader, newdata = h_test)

datatable(as.data.frame(head(contributions)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

7 Your Turn - Target “DBUG”

Get your hands dirty!

8 Q & A